Chapter 1: Start me up!

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
   The aim of the course is to teach how to use R, RStudio and GitHub as well as how to do statistical analyses with R.
   GitHub repository: https://github.com/phuoc-tn/IODS-project

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Tue Dec 13 04:22:21 2022"
paste("Today I stepped on", sample(1:100, size = 1, replace = FALSE), "LEGO brick(s).")
## [1] "Today I stepped on 72 LEGO brick(s)."


How are you feeling right now?
   Quite tired.

What do you expect to learn?
   How/when/why to do certain statistical analyses. Maybe new tricks to use in R.

Where did you hear about the course?
   My doctoral school informed me in an email.

How did it work as a “crash course” on modern R tools and using RStudio?
   Quite intensive.

Which were your favorite topics?
   Data parsing, using partial matches to find columns.

Which topics were most difficult?
   Transposing data frames.

Some other comments on the book and our new approach of getting started with R Markdown etc.?
   The online book is a great resource! I will definitely bookmark it for future use.


Chapter 2: Regression and model validation

Describe the work you have done this week and summarize your learning.

The main highlight for this week was that I learned how to do, (somewhat) understand, and interpret linear regression models and their results in R. I have more insight to identifying and validating variables and models to explain observed data. Let’s hope that I will get a chance to utilize these in the future.

date()
## [1] "Tue Dec 13 04:22:21 2022"

Analysis exercises

Task 1.

Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it.

1) Load tidyverse.

library(tidyverse)

2) Import data from CSV file created in data wrangling.

learning2014 <- read_csv(file = "Data/learning2014.csv", col_names = TRUE)

3) Check structure and dimensions of data.

str(learning2014)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender  : chr [1:166] "F" "M" "F" "M" ...
##  $ age     : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   attitude = col_double(),
##   ..   deep = col_double(),
##   ..   stra = col_double(),
##   ..   surf = col_double(),
##   ..   points = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
dim(learning2014)
## [1] 166   7

Answer: The data is based on a statistical study about how students’ learning approaches and strategies relate to their achievements in learning statistics in an introductory course taught in the University of Helsinki, Finland. It was sampled from the original data set of the study, and consists of seven variables (gender, age, attitude, deep, stra, surf, and points) measured with 166 observations acquired from a questionnaire. The attitude variable (adjusted by dividing original attidute scores by ten) describes the attitude of students toward learning statistics measured with ten statements. The variables deep, stra, and surf are abbreviated from deep (aim to engage, learn, and understand course material), strategic (aim to maximize points via any approach), and surface (aim to pass with minimal effort) learning approaches, respectively, and were measured in a 1–5 scale (1 = disagree, 5 = agree).

Task 2.

Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.

1.1) Use pairs() to visualize data.

pairs(learning2014[-1],
      col = c("red", "blue")[factor(learning2014$gender)], # Use gender for colors.
      oma = c(2, 2, 2, 10)) # Set the margins outside of plot.
par(xpd = TRUE) # Allow legend to be outside of plot area.
legend(0.95, 0.6, # Coordinates for legend position.
       legend = unique(learning2014$gender),
       fill = c("red", "blue"))

Getting colors and legend with pairs() was harder than expected. Maybe too hard…

1.2) Or use ggplot2 and GGally to visualize data.

library(tidyverse) # ggplot2 is part of tidyverse.
library(GGally)

ggpairs(
  learning2014,
  mapping = aes(col = gender, alpha = 0.3),
  lower = list(combo = wrap("facethist", bins = 20))
) +
  scale_color_manual(values = c("red", "blue")) + # Assign new colors manually.
  scale_fill_manual(values = c("red", "blue")) + # Same here.
  theme_bw() + # Change default theme to something more clean.
  theme(strip.text = element_text(size = 14)) # Adjust panel title sizes.

2) Summarize data.

learning2014$gender <- as.factor(learning2014$gender) # Changed gender from character to factor for summary.
summary(learning2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

Answer: The age of the students ranged from 17 to 55 with most being ca. 20 years old. Based on the results, this does not seem to correlate with the number of points acquired during the statistics course. Although participating female students (n = 110) outnumbered males (n = 56) by 2:1, there were minimal differences in points acquired between these two groups. The males however scored, on average, higher in attitude compared to females.
   A significant positive correlation was found between the attitude towards learning statistics and high points; indicating that the former is a good predictor for the latter. The individual learning approaches (deep, surface, and strategic) did not strongly correlate with points. However, among them the strategic approach seems to positively correlate the most with higher points which aligns with the general aim the approach — maximize the number of points with any approach. This approach was also slightly preferred more by females compared to males on average.

Task 3.

Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent, outcome) variable. Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters. If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it.

1) Fit the learning approaches to linear model.

linear_model <- lm(points ~ deep + stra + surf, data = learning2014)

2) Check summary of linear model.

summary(linear_model)
## 
## Call:
## lm(formula = points ~ deep + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1235  -3.0737   0.5226   4.2799  10.3229 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.9143     5.1169   5.260  4.5e-07 ***
## deep         -0.7443     0.8662  -0.859   0.3915    
## stra          0.9878     0.5962   1.657   0.0994 .  
## surf         -1.6296     0.9153  -1.780   0.0769 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.827 on 162 degrees of freedom
## Multiple R-squared:  0.04071,    Adjusted R-squared:  0.02295 
## F-statistic: 2.292 on 3 and 162 DF,  p-value: 0.08016

3) Remove the approaches due to them not having a significant relationship with exam points, and redo the test as instructed.

new_linear_model <- lm(points ~ 1, data = learning2014)
summary(new_linear_model)
## 
## Call:
## lm(formula = points ~ 1, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.7169  -3.7169   0.2831   5.0331  10.2831 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.7169     0.4575   49.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.895 on 165 degrees of freedom

Answer: In short, linear regression model tests the correlation between predicting variables, which are in my case the three learning approached (deep, strategic, and surface), and predicted outcomes, i.e. the exam points. At a cursory view of the summary table and the p-values in the Coefficients section, none of the three approaches are statistically significant; meaning that there is no correlation between them and the exam points.

Task 4.

Using a summary of your fitted model, explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters). Explain and interpret the multiple R-squared of the model.

Answer: The three learning approaches are not significant explanatory variables for the exam points. This can be seen, for example, in the previously mentioned Coefficients section, which shows the correlation and significance of the selected predictors (learning approaches in this case) and the exam points. The statistical significance of the predictors is shown with different symbols next to the p-values, and their corresponding values are shown immediately below (Signif. codes). The overall significance of the model — or the lack thereof — can also be seen in the F-statistic section.
   The estimates in the Coefficients section suggest (interestingly enough) that only the strategic approach positively correlates with higher exam points. This makes sense assuming students use deep and/or surfaces learning approaches depending on how familiar they are with certain topics in statistics in order to maximize points.
   The multiple R2 value shows how much the predicting variables explain the variance in the outcomes in our dataset. This means that the three approaches only explain ca. 4.1% of the variance. The adjusted R2 value, as name suggests, is scaled so that adding more variables does not result in higher R2 values. In this model, only 2.3% of the variance in exam points can be explained by the three approaches.

Task 5.

Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage. Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots.

1) Generate requested plots from the linear model with the three learning approaches.

par(mfrow = c(2, 2),
    mar = c(2, 2, 2, 2)) # Decrease margins around plots.
plot(linear_model, which = c(1, 2, 5), lwd = 2) # lwd = line width.

Answer: The Residuals vs Fitted plot shows the linearity of the fitted values and their residuals. Comparing the red line to the dashed grey line shows that the linearity between the values and residuals in my linear model is not the greatest. The Normal QQ-plot shows whether the values of the model follow a normal distribution. Looking at the plot, the data mostly follows a normal distribution apart from the tails. The Residuals vs Leverage plot shows which data points are “influential” meaning which points significantly affect the results of the linear model. Data points outside the Cook’s distance shown in grey dashed line(s) would be influential. Because no data points in this model are outside of the line, none of them are influential.


Chapter 3: Logistic regression

date()
## [1] "Tue Dec 13 04:22:34 2022"

Analysis exercises

Task 2.

The CSV file for this week’s assignment contains modified data from a study done by Prof. Paulo Cortez and Alice Silva, and which was published in 2008 (source). The table has measurements (obtained via school reports and questionnaires) of variables related to the students’ performances in mathematics and Portuguese from secondary education in two Portuguese schools (Gabriel Pereira and Mousinho da Silveira). The following attributes are found in the dataset:

library(tidyverse)

student_alc <- read_csv(file = "https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", col_names = TRUE)
colnames(student_alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The “high_use” is an added variable, and can be either TRUE or FALSE depending on whether “alc_use” is higher than two or not.

Task 3.

Due to the file containing so many interesting variables to choose from, I decided to let R — and fate — choose the ones I will be using for the rest of the exercises. These were:

set.seed(2135) # Seed number for reproducibility, when using random sampling. 

interesting_variables <- sample(colnames(student_alc), size = 4, replace = FALSE)
interesting_variables
## [1] "freetime"   "sex"        "traveltime" "Pstatus"

The Free time after school could have connections to high alcohol usage as the more time you have after school, especially if you’re a student between 15 to 22 years old, the more time you have for a drink (or two). Sex might have also connections because men typically drink more that women. Travel time, which is specified to be from home to school, is a bit harder to imagine to be related to high alcohol consumption. Maybe Portuguese students drink during long trips to school? Who knows. ¯\_(ツ)_/¯ Lastly, the parent’s cohabitation status (together or apart) might have negative correlation due to children in two-parent households having both parents to look after them, thus having less opportunities or reasons to drink.

Task 4.

library(RColorBrewer)
library(patchwork)
library(gridExtra)

interesting_variables <- append(interesting_variables, "high_use")

student_alc %>%
  select(all_of(interesting_variables)) %>% gather() %>%
ggplot(aes(value, fill = value)) +
  geom_bar() +
  scale_fill_brewer(palette = "Paired", guide = "none") +
  theme_bw() +
  theme(strip.text = element_text(size = 11, face = "bold"),
        axis.title = element_text(size = 11, face = "bold")) +
  facet_wrap("key", scales = "free")

Looking at the bar plots for the interesting variables, I notice that observations for free time are almost normally distributed. Most of the student don’t have high alcohol usage. Interestingly, there is quite a skew with the parent’s cohabitation status with most being together. The sex distribution is almost equal, and most students have a short travel time to school.

library(ggpubr)

new_student_alc <- student_alc %>%
  select(all_of(interesting_variables))

plot_a <-
  ggplot(new_student_alc, aes(x = high_use, y = freetime, fill = sex)) +
  geom_boxplot() +
  scale_fill_brewer(
    palette = "Dark2",
    name = "Sex",
    labels = c("Female", "Male")
  ) +
  scale_y_discrete(limits = c("very low", "low", "medium", "high", "very high")) +
  ylab("Free time\n(after school)") +
  theme_bw() +
  theme(
    axis.title.x = element_blank(),
    axis.title = element_text(face = "bold"),
    legend.title = element_text(face = "bold")
  )

plot_b <-
  ggplot(new_student_alc, aes(x = high_use, y = traveltime, fill = sex)) +
  geom_boxplot() +
  scale_fill_brewer(
    palette = "Dark2",
    name = "Sex",
    labels = c("Female", "Male")
  ) +
  scale_y_discrete(limits = c("<15 min", "15–30 min", "30–60 min", ">60 min")) +
  ylab("Travel time\n(home to school)") +
  theme_bw() +
  theme(
    axis.title.x = element_blank(),
    axis.title = element_text(face = "bold"),
    legend.title = element_text(face = "bold")
  )

plot_c <-
  ggplot(new_student_alc, aes(x = high_use, y = freetime, fill = Pstatus)) +
  geom_boxplot() +
  scale_fill_brewer(
    palette = "Paired",
    name = "Parent's\ncohabitation\nstatus",
    labels = c("Apart", "Together")
  ) +
  scale_y_discrete(limits = c("very low", "low", "medium", "high", "very high")) +
  ylab("Free time\n(after school)") +
  theme_bw() +
  theme(
    axis.title.x = element_blank(),
    axis.title = element_text(face = "bold"),
    legend.title = element_text(face = "bold")
  )

plot_d <-
  ggplot(new_student_alc, aes(x = high_use, y = traveltime, fill = Pstatus)) +
  geom_boxplot() +
  scale_fill_brewer(
    palette = "Paired",
    name = "Parent's\ncohabitation\nstatus",
    labels = c("Apart", "Together")
  ) +
  scale_y_discrete(limits = c("<15 min", "15–30 min", "30–60 min", ">60 min")) +
  ylab("Travel time\n(home to school)") +
  theme_bw() +
  theme(
    axis.title.x = element_blank(),
    axis.title = element_text(face = "bold"),
    legend.title = element_text(face = "bold")
  )

joined_plot <- (plot_a + plot_b + plot_layout(guides = "collect")) /
  (plot_c + plot_d + plot_layout(guides = "collect")) +
  plot_annotation(tag_levels = "A") &
  theme(plot.tag = element_text(face = "bold"))

grid.arrange(
  patchworkGrob(joined_plot),
  bottom = text_grob("High alcohol consumption", face = "bold")
)

Looking at the box plots, there doesn’t seem to be much difference in alcohol consumption with travel time among the sexes nor the parent’s cohabitation status (panels B and D). However with free time (panels A and C), there seems to be small increase with more free time. This somewhat aligns with my assumptions. The increase seems to be also higher on average with males (panel A), which aligns with my initial hypotheses. Curiously, in notice that students with separated parents have on average less free time, but also higher alcohol consumption. This contradicts my hypotheses.

Task 5.

model <- glm(high_use ~ freetime + sex + traveltime + Pstatus - 1, data = student_alc, family = "binomial")
summary(model)
## 
## Call:
## glm(formula = high_use ~ freetime + sex + traveltime + Pstatus - 
##     1, family = "binomial", data = student_alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3632  -0.8695  -0.6220   1.1984   1.9127  
## 
## Coefficients:
##            Estimate Std. Error z value Pr(>|z|)    
## freetime     0.3023     0.1253   2.413   0.0158 *  
## sexF        -2.6603     0.5946  -4.474 7.69e-06 ***
## sexM        -1.8937     0.6263  -3.023   0.0025 ** 
## traveltime   0.4015     0.1617   2.483   0.0130 *  
## PstatusT    -0.1924     0.3912  -0.492   0.6228    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 512.93  on 370  degrees of freedom
## Residual deviance: 424.31  on 365  degrees of freedom
## AIC: 434.31
## 
## Number of Fisher Scoring iterations: 4

The summary shows that Pstatus might not statistically significant in my model.

OR <- coef(model) %>% exp
CI <- confint(model) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
##                    OR      2.5 %    97.5 %
## freetime   1.35291603 1.06175190 1.7370368
## sexF       0.06992947 0.02089100 0.2168783
## sexM       0.15051787 0.04259748 0.5008024
## traveltime 1.49409723 1.08919695 2.0594122
## PstatusT   0.82496011 0.38967324 1.8280994

Pstatus has the widest confidence interval. The lower interval is below 1 and upper higher than 1, meaning there’s no statistically significant association with high alcohol usage. This seems to contradict previous results. Free and travel time has statistically significant positive association (upper and lower intervals >1), which aligns with my assumptions. Meanwhile sex has negative (upper and lower intervals <1), which contradicts my assumptions.

library(finalfit)

factor_student_alc <-
  new_student_alc %>%
  mutate(sex = factor(sex) %>%
           fct_recode("Female" = "F",
                      "Male"   = "M") %>%
           ff_label("Sex")) %>%
  mutate(
    Pstatus = factor(Pstatus) %>%
      fct_recode("Apart" = "A",
                 "Together"   = "T") %>%
      ff_label("Parent's cohabitation status")
  ) %>%
  mutate(
    traveltime = factor(traveltime) %>%
      fct_recode(
        "<15 min" = "1",
        "15–30 min" = "2",
        "30–60 min" = "3",
        ">60 min" = "4"
      ) %>%
      ff_label("Travel time (home to school)")
  ) %>%
  mutate(
    freetime = factor(freetime) %>%
      fct_recode(
        "very low" = "1",
        "low" = "2",
        "medium" = "3",
        "high" = "4",
        "very high"   = "5"
      ) %>%
      ff_label("Free time (after school)")
  ) %>%
  mutate(high_use = ff_label(high_use, "High alcohol consumption"))

factor_student_alc %>%
  summary_factorlist(
    "high_use",
    c("freetime", "sex", "traveltime", "Pstatus"),
    p = TRUE,
    add_dependent_label = TRUE
  )
## Warning in chisq.test(traveltime, high_use): Chi-squared approximation may be
## incorrect
##  Dependent: High alcohol consumption                FALSE      TRUE      p
##             Free time (after school)  very low   15 (5.8)   2 (1.8)  0.032
##                                            low  46 (17.8) 14 (12.6)       
##                                         medium 112 (43.2) 40 (36.0)       
##                                           high  65 (25.1) 40 (36.0)       
##                                      very high   21 (8.1) 15 (13.5)       
##                                  Sex    Female 154 (59.5) 41 (36.9) <0.001
##                                           Male 105 (40.5) 70 (63.1)       
##         Travel time (home to school)   <15 min 174 (67.2) 68 (61.3)  0.003
##                                      15–30 min  73 (28.2) 26 (23.4)       
##                                      30–60 min   10 (3.9)  11 (9.9)       
##                                        >60 min    2 (0.8)   6 (5.4)       
##         Parent's cohabitation status     Apart  26 (10.0) 12 (10.8)  0.970
##                                       Together 233 (90.0) 99 (89.2)

Using summary_factorlist() from the finalfit package reveals the same.

factor_student_alc %>%
  finalfit("high_use",
           c("sex", "Pstatus", "traveltime", "freetime"),
           metrics = TRUE)
## [[1]]
##  Dependent: High alcohol consumption                FALSE      TRUE
##                                  Sex    Female 154 (79.0) 41 (21.0)
##                                           Male 105 (60.0) 70 (40.0)
##         Parent's cohabitation status     Apart  26 (68.4) 12 (31.6)
##                                       Together 233 (70.2) 99 (29.8)
##         Travel time (home to school)   <15 min 174 (71.9) 68 (28.1)
##                                      15–30 min  73 (73.7) 26 (26.3)
##                                      30–60 min  10 (47.6) 11 (52.4)
##                                        >60 min   2 (25.0)  6 (75.0)
##             Free time (after school)  very low  15 (88.2)  2 (11.8)
##                                            low  46 (76.7) 14 (23.3)
##                                         medium 112 (73.7) 40 (26.3)
##                                           high  65 (61.9) 40 (38.1)
##                                      very high  21 (58.3) 15 (41.7)
##                OR (univariable)             OR (multivariable)
##                               -                              -
##    0.19 (0.10 to 0.28, p<0.001)   0.15 (0.06 to 0.25, p=0.002)
##                               -                              -
##  -0.02 (-0.17 to 0.14, p=0.823) -0.06 (-0.21 to 0.09, p=0.464)
##                               -                              -
##  -0.02 (-0.12 to 0.09, p=0.734) -0.02 (-0.12 to 0.09, p=0.724)
##    0.24 (0.04 to 0.45, p=0.019)   0.25 (0.05 to 0.45, p=0.015)
##    0.47 (0.15 to 0.79, p=0.004)   0.42 (0.10 to 0.73, p=0.010)
##                               -                              -
##   0.12 (-0.13 to 0.36, p=0.355)  0.08 (-0.16 to 0.32, p=0.525)
##   0.15 (-0.08 to 0.37, p=0.212)  0.11 (-0.11 to 0.33, p=0.334)
##    0.26 (0.03 to 0.50, p=0.027)  0.22 (-0.01 to 0.45, p=0.066)
##    0.30 (0.04 to 0.56, p=0.026)  0.21 (-0.06 to 0.47, p=0.124)
## 
## [[2]]
##                                                                                                                                                   
##  Number in dataframe = 370, Number in model = 370, Missing = 0, Log-likelihood = -218.3, AIC = 458.6, R-squared = 0.093, Adjusted R-squared = 0.07
factor_student_alc %>%
  or_plot(
    "high_use",
    c("sex", "Pstatus", "traveltime", "freetime"),
    title_text_size = 13,
    table_text_size = 4,
    column_space = c(-3, 3.2, 7.5),
    breaks = c(0.5, 1, 2.5, 5, 10, 25, 50)
  )

Task 6.

Based on the previous model results, I will exclude Pstatus from further analyses.

model <- glm(high_use ~ sex + freetime + traveltime - 1, data = factor_student_alc, family = "binomial")
probabilities <- predict(model, type = "response")
factor_student_alc <- mutate(factor_student_alc, probability = probabilities) %>% 
  mutate(factor_student_alc, prediction = probability > 0.5)
table(high_use = factor_student_alc$high_use, prediction = factor_student_alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   252    7
##    TRUE     97   14
table(high_use = factor_student_alc$high_use,
      prediction = factor_student_alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.68108108 0.01891892 0.70000000
##    TRUE  0.26216216 0.03783784 0.30000000
##    Sum   0.94324324 0.05675676 1.00000000
ggplot(factor_student_alc,
       aes(x = probability, y = high_use, col = prediction)) +
  geom_point() +
  theme_bw()

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

loss_func(class = factor_student_alc$high_use, prob = factor_student_alc$probability)
## [1] 0.2810811

According to the results of cross validation, it seems that 28.1% of the average number of predictions in my final model are wrong, which probably isn’t that great.


Chapter 4: Clustering and classification

date()
## [1] "Tue Dec 13 04:22:42 2022"

Analysis exercises

Task 2.

The data set, titled “Boston”, used in this exercise contains 14 variables (columns) and 506 observations (rows). It entails information about housing values used to evaluate the willingness of people to pay for cleaner air in the Boston metropolitan area in the 1970s (source).

library(MASS)
data("Boston")

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

Task 3.

Looking at the plots and summary table, it looks like most variables are not equally distributed, such as crim, zn, and rad. Also many of the variables have strong correlations (both positive and negative) with each other based on the plots. The weakest correlating variable seems to be chas, i.e. the Charles River dummy variable. This might be due to it being a binary variable (0 or 1). Interestingly, indus, i.e. the proportion of non-retail business acres per town, and tax, i.e. full-value property-tax rate per 10 000 $, are bimodally distributed.

library(GGally)

ggpairs(Boston, lower = list(
  continuous = wrap(
    "smooth",
    alpha = 0.3,
    size = 0.5,
    color = "firebrick1"
  )
)) + theme_bw()

library(tidyverse)
library(corrplot)

cor_matrix <- cor(Boston) %>% round(digits = 2)

corrplot(
  cor_matrix,
  method = "circle",
  type = "lower",
  cl.pos = "b",
  tl.pos = "d",
  tl.cex = 0.6
)

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Task 4.

The original Boston data set had variables measured in different ways and scales/magnitudes. After using scale(), all variables and observations have been normalized to the same scale. This allows for better comparison between them.

boston_scaled <- Boston %>% scale()

summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
boston_scaled_df <- as.data.frame(boston_scaled)
bins <- quantile(boston_scaled_df$crim)

crime <-
  cut(
    boston_scaled_df$crim,
    breaks = bins,
    include.lowest = TRUE
  )

boston_scaled_df <- boston_scaled_df %>% dplyr::select(-crim)
boston_scaled_df <- data.frame(boston_scaled_df, crime)

levels(boston_scaled_df$crime) <- c("low", "med_low", "med_high", "high")

n <- nrow(boston_scaled_df)

set.seed(2126)
ind <- sample(n,  size = n * 0.8)

train <- boston_scaled_df[ind,]
test <- boston_scaled_df[-ind,]

Task 5.

lda.fit <- lda(crime ~ ., data = train)

lda.arrows <-
  function(x,
           myscale = 1,
           arrow_heads = 0.1,
           color = "black",
           tex = 0.75,
           choices = c(1, 2)) {
    heads <- coef(x)
    arrows(
      x0 = 0,
      y0 = 0,
      x1 = myscale * heads[, choices[1]],
      y1 = myscale * heads[, choices[2]],
      col = color,
      length = arrow_heads
    )
    text(
      myscale * heads[, choices],
      labels = row.names(heads),
      cex = tex,
      col = color,
      pos = 3
    )
  }

classes <- as.numeric(train$crime)

plot(lda.fit,
     dimen = 2,
     col = c("red", "blue", "purple", "gold")[classes])
lda.arrows(lda.fit, myscale = 2)

Task 6.

The table shows that the majority of the predicted results of our model overall are correct (74 out of 102, 72.55%). Looking at the table in more detail, the model predicted correctly 16 out of 25 (64.0%) low crime rates, 16 out of 28 (57.14%) medium low crime rates, 12 out of 18 (66.67%) medium high crime rates, and 30 out of 31 (96.77%) high crime rates. This shows that the model is highly accurate in predicting high crime rates, however is somewhat inaccurate with lower crime rates, especially with medium low ones.

correct_classes <- test$crime
new_test <- test %>% dplyr::select(-crime)

lda.pred <- predict(lda.fit, newdata = new_test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       16       5        1    0
##   med_low    8      16        5    0
##   med_high   1       7       12    1
##   high       0       0        0   30

Task 7.

As it was not specified which distances we were supposed to compute, might as well compute both Euclidian and Manhattan distances. For the initial k-means analysis, I randomly picked three clusters. But after determining the optimal number of clusters the re-analysis with the total of within cluster sum of squares, I ended up with two based on the line plot. That was due to the steepest drop being between one and two clusters. The scaling of the Boston dataset seems to do weird things to the distribution of the rad variable. I’m not sure why. Anyway, looking at the new distribution with k-means of two clusters, the indus, nox, and tax variable have two different colored peaks. This makes sense as I previously stated that these are bimodally distributed. I would assume that the scaled data has two centroids.

data("Boston")

new_boston_scaled <- as.data.frame(scale(Boston))
dist_eu <- dist(new_boston_scaled)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
dist_man <- dist(new_boston_scaled, method = "manhattan")
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618
library(RColorBrewer)

k_means <- kmeans(new_boston_scaled, centers = 3)

ggpairs(
  as.data.frame(new_boston_scaled),
  aes(color = as.factor(k_means$cluster)),
  lower = list(continuous = wrap(
    "smooth",
    alpha = 0.3,
    size = 0.5
  )),
  upper = list(continuous = wrap("cor", size = 2))
) +
  scale_color_brewer(palette = "Dark2") +
  scale_fill_brewer(palette = "Dark2") +
  theme_bw()

twcss <- sapply(1:10, function(k){kmeans(new_boston_scaled, k)$tot.withinss})

qplot(x = 1:10, y = twcss, geom = "line") +
  scale_x_continuous(breaks = c(1:10)) +
  labs(x = "clusters") +
  theme_bw()

new_k_means <- kmeans(new_boston_scaled, centers = 2)

ggpairs(
  new_boston_scaled,
  aes(color = as.factor(new_k_means$cluster)),
  lower = list(continuous = wrap(
    "smooth",
    alpha = 0.3,
    size = 0.5
  )),
  upper = list(continuous = wrap("cor", size = 2))
) +
  scale_color_brewer(palette = "Dark2") +
  scale_fill_brewer(palette = "Dark2") +
  theme_bw()


Chapter 5: Dimensionality reduction techniques

date()
## [1] "Tue Dec 13 04:24:09 2022"

Analysis exercises

Overview of the data

At a glance, each variable seems to follow a (positive or negative) skewed normal distribution. Only Edu.Exp i.e, expected years of education, is nearly symmetrically distributed. Certain variables, such as gross national income (GNI), maternal mortality ratio (Mat.Mor), and adolescent birth rate (Ado.Birth), have a wide range between their minimum and maximum values as well as their means. This suggests an uneven and highly skewed distribution. Curiously, the highest GNI (123 124), which is also quite an outlier, belongs to Qatar. I wonder whether this has anything to do with its foreign aid/investment(s), oil and gas exports, and its currently hosted FIFA World Cup.
   The highest positive and statistically significant correlation is between expected years of education and life expectancy at birth (Life.Exp; 0.789), and conversely, the lowest is between (Mat.Mor) and life expectancy at birth (−0.857). The former makes sense as if you have a high life expectancy, you most likely live in a country in which you do not have to struggle to survive. Therefore, you can spend more time getting a degree. The latter also is sensible: if a mother dies at child birth, the child’s survival decreases, especially in developing countries.

library(tidyverse)
library(GGally)

human_data <-
  read.csv("Data/new_human.csv",
           header = T,
           row.names = "X")

custom_scatter_plots <-
  function(data, mapping, method = "lm", ...) {
    ggplot(data = data, mapping = mapping) +
      geom_point(colour = "dodgerblue",
                 alpha = 0.3,
                 size = 1) +
      geom_smooth(method = method,
                  color = "blueviolet",
                  linewidth = 0.5,
                  ...)
  }

ggpairs(
  human_data,
  lower = list(continuous = wrap(custom_scatter_plots)),
  upper = list(continuous = wrap("cor", size = 4, color = "black"))
) +
  theme_bw() +
  theme(
    strip.text = element_text(size = 11, color = "black", face = "bold"),
    axis.text.x = element_text(
      angle = 45,
      hjust = 1,
      vjust = 1
    )
  )

summary(human_data)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

Principal component analysis (PCA)

The results between standardized and non-standardized human data is quite different. The loads (indicated with red arrows) for the PCA results of the non-standardized human data reveals that the scales between the observations among the variables is really different. Looking at the data, gross national income (GNI), and maternal mortality ratio (Mat.Mor) are measured in the thousands, meanwhile the other variables are either ratios or measured — at most — in the hundreds. This results in them being the most contributing factors in the analysis, and thus creates the boomerang-shaped distribution in the non-standardized plot. The PCA dimensions also look odd as dimension 1 explains 99.99% of the variance in the data, while dimension 2 only 0.01%. Standardizing amends this as the variables are then comparable to each other on a same scale, which can be seen e.g., with the loads looking more reasonable (dimension 1 explains 53.6% and dimension 2 16.24% of the variance), and more circular distribution of data points.
   The clustering of countries in the PCA of standardized human data is interesting. Western nations (in Europe and North America) as well as Australia and New Zealand cluster towards female-to-male labor force participation rate (Labo.FM), female percent representation in parliament (Parli.F), expected years of education (Edu.Exp), GNI, female-to-male ratio with secondary education (Edu2.FM), and life expectancy at birth (Life.Exp). This is reasonable as these nations are known for stability, safety, and fulfilling human rights.
   Curiously many of the Caribbean and Latin american countries are clustering only to Parli.F and Labo.FM. This suggests that these countries have more/many females in parliamentary positions and in the work force, but also are not the wealthiest (based on the GNI load) nor the safest (based on e.g., Mat.Mor or Life.Exp). The high values in Parli.F and Labo.FM might be due to poor families sending all available/capable family members to work out of necessity.
   The Sub-Saharan, ca. half of the East Asian, and most of the South Asian countries cluster mainly due to Mat.Mor and Ado.Birth. Considering these countries are mostly poor developing countries with less-than-great track record with human rights, it is not surprising that females in these regions give more birth while young and die more during child birth most likely due to the former.
   Lastly, most of the countries in the Middle East and North Africa cluster towards Edu.Exp, GNI, Edu2.FM, and Life.Exp. High values in GNI are sensible due to, for example, oil and gas exports, whilst in the others the reasons are unclear (at least, for me at the moment of writing).

library(ggfortify)
library(ggthemes)
library(countrycode)
library(patchwork)

pca_human <- prcomp(human_data)
new_human_data <-
  human_data %>% mutate(Region = countryname(row.names(human_data), destination = "region"))

pca1 <- autoplot(
  pca_human,
  data = new_human_data,
  colour = "Region",
  label = TRUE,
  label.size = 4,
  label.repel = TRUE,
  loadings = TRUE,
  loadings.label = TRUE,
  loadings.label.size = 4,
  loadings.label.repel = TRUE,
) +
  scale_color_manual(values = ptol_pal()(length(unique(
    new_human_data$Region
  ))),
  guide = guide_legend(override.aes = list(size = 3))) +
  geom_hline(yintercept = 0,
             linetype = "dashed",
             alpha = 0.4) +
  geom_vline(xintercept = 0,
             linetype = "dashed",
             alpha = 0.4) +
  ggtitle("Non-standardized human data") +
  theme_bw()

pca_human_scaled <- prcomp(scale(human_data))

pca2 <- autoplot(
  pca_human_scaled,
  data = new_human_data,
  colour = "Region",
  label = TRUE,
  label.size = 4,
  label.repel = TRUE,
  loadings = TRUE,
  loadings.label = TRUE,
  loadings.label.size = 4,
  loadings.label.repel = TRUE
) +
  scale_color_manual(values = ptol_pal()(length(unique(
    new_human_data$Region
  ))),
  guide = guide_legend(override.aes = list(size = 3))) +
  geom_hline(yintercept = 0,
             linetype = "dashed",
             alpha = 0.4) +
  geom_vline(xintercept = 0,
             linetype = "dashed",
             alpha = 0.4) +
  ggtitle("Standardized human data") +
  theme_bw()

pca1 + pca2 + plot_layout(guides = "collect") &
  theme(legend.text = element_text(color = "black", size = 10))
**Principal component analysis (PCA) of standardized and non-standardized human data.** **In the left panel**, gross national income (GNI), and maternal mortality ratio (Mat.Mor) pull the data in two directions: Affluent oil-exporting countries, such as Qatar, United Arab Emerates, and United States, and banking countries, such as Luxemburg and Switzerland, cluster towards the former; whilst many of the Sub-Saharan African countries are pulled towards  the latter. Dimension 1 explains 99.99% of the variance, and dimension 2 explains 0.01%. **In the right panel**, there are overlapping region clusters pulled in three directions based on loads: countries in the Sub-Saharan Africa cluster towards Mat.Mor, adolescent birth rate (Ado.Birth), female percent representation in parliament (Parli.F), and female-to-male labour force participation rate (Labo.FM); several Latin American and Caribbean countries towards Labo.FM and Parli.F; European, North American, and Pacific countries towards Labo.FM, Parli.F, expected years of education (Edu.Exp), GNI, female-to-male ratio with secondary education (Edu2.FM), and life expectancy at birth (Life.Exp); and lastly, countries in South, East and Central Asia, Middle East, and North Africa towards Edu.Exp, GNI, Edu2.FM, and Life.Exp. Dimensions 1 and 2 explain 53.6% and 16.24% of the variance, respectively.

Principal component analysis (PCA) of standardized and non-standardized human data. In the left panel, gross national income (GNI), and maternal mortality ratio (Mat.Mor) pull the data in two directions: Affluent oil-exporting countries, such as Qatar, United Arab Emerates, and United States, and banking countries, such as Luxemburg and Switzerland, cluster towards the former; whilst many of the Sub-Saharan African countries are pulled towards the latter. Dimension 1 explains 99.99% of the variance, and dimension 2 explains 0.01%. In the right panel, there are overlapping region clusters pulled in three directions based on loads: countries in the Sub-Saharan Africa cluster towards Mat.Mor, adolescent birth rate (Ado.Birth), female percent representation in parliament (Parli.F), and female-to-male labour force participation rate (Labo.FM); several Latin American and Caribbean countries towards Labo.FM and Parli.F; European, North American, and Pacific countries towards Labo.FM, Parli.F, expected years of education (Edu.Exp), GNI, female-to-male ratio with secondary education (Edu2.FM), and life expectancy at birth (Life.Exp); and lastly, countries in South, East and Central Asia, Middle East, and North Africa towards Edu.Exp, GNI, Edu2.FM, and Life.Exp. Dimensions 1 and 2 explain 53.6% and 16.24% of the variance, respectively.

Multiple Correspondence Analysis (MCA)

I decided on running the MCA on six random variables: escape.exoticism, where, dinner, Tea, resto, and pub. Looking at the Eigenvalues in the summary, Dim.1 and Dim.2 explain overall 19.761% and 14.687% of the variance, respectively. In the Categories section, the ctr column in Dim.1 shows five categories with highest contribution (crt > 12.0): chain store+tea shop, tea shop, green, resto, and pub.
   The plots show correlations between categories. For example, dinner and tea shop cluster near each other, suggesting strong correlation, which would make sense as people probably would drink tea during dinner at tea shops.

# Load tea data.
tea_data <-
  read.csv(
    "https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv",
    stringsAsFactors = TRUE
  )

# Show data structure.
str(tea_data)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
# Plot data.
tea_data %>%
  dplyr::select(-age) %>% # Removed age due to age_Q already existing as factor.
  pivot_longer(cols = everything()) %>%
  ggplot(aes(value, fill = name)) +
  geom_bar() +
  guides(fill = "none") +
  theme_bw() +
  theme(axis.text.x = element_text(
    angle = 45,
    hjust = 1,
    size = 8
  )) +
  facet_wrap("name", scales = "free")

library(FactoMineR)

# Pick randomly six variables.
set.seed(2133)
interesting_vars <- sample(colnames(tea_data), 6)
new_tea_data <- tea_data %>% dplyr::select(all_of(interesting_vars))

# Run MCA.
mca_tea_data <- MCA(new_tea_data, graph = FALSE)
# summary(mca_tea_data, nbelements = Inf) # nbelements = Inf prints whole summary with its 300 individuals.
summary(mca_tea_data)
## 
## Call:
## MCA(X = new_tea_data, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.263   0.196   0.182   0.164   0.153   0.136   0.123
## % of var.             19.761  14.687  13.629  12.290  11.484  10.230   9.258
## Cumulative % of var.  19.761  34.448  48.076  60.366  71.850  82.080  91.338
##                        Dim.8
## Variance               0.115
## % of var.              8.662
## Cumulative % of var. 100.000
## 
## Individuals (the 10 first)
##                         Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                    |  0.260  0.085  0.078 | -0.298  0.151  0.102 |  0.709
## 2                    |  0.133  0.022  0.020 | -0.609  0.632  0.411 |  0.247
## 3                    |  0.123  0.019  0.005 |  0.670  0.764  0.147 | -0.241
## 4                    |  0.454  0.260  0.076 |  0.166  0.047  0.010 | -0.954
## 5                    |  0.059  0.004  0.007 | -0.524  0.468  0.563 | -0.425
## 6                    |  0.580  0.425  0.127 |  0.477  0.388  0.086 | -0.493
## 7                    |  0.185  0.043  0.076 | -0.213  0.077  0.100 |  0.036
## 8                    |  0.260  0.085  0.078 | -0.298  0.151  0.102 |  0.709
## 9                    | -0.133  0.022  0.021 |  0.153  0.040  0.028 | -0.030
## 10                   | -0.058  0.004  0.003 |  0.067  0.008  0.004 |  0.642
##                         ctr   cos2  
## 1                     0.922  0.578 |
## 2                     0.112  0.068 |
## 3                     0.107  0.019 |
## 4                     1.671  0.339 |
## 5                     0.332  0.371 |
## 6                     0.445  0.091 |
## 7                     0.002  0.003 |
## 8                     0.922  0.578 |
## 9                     0.002  0.001 |
## 10                    0.756  0.330 |
## 
## Categories (the 10 first)
##                          Dim.1     ctr    cos2  v.test     Dim.2     ctr
## escape-exoticism     |  -0.205   1.258   0.038  -3.360 |  -0.436   7.646
## Not.escape-exoticism |   0.184   1.131   0.038   3.360 |   0.392   6.872
## chain store          |   0.124   0.623   0.027   2.860 |  -0.464  11.748
## chain store+tea shop |  -0.855  12.027   0.257  -8.765 |   0.505   5.649
## tea shop             |   1.429  12.924   0.227   8.239 |   1.658  23.410
## dinner               |   1.130   5.656   0.096   5.362 |   1.704  17.301
## Not.dinner           |  -0.085   0.426   0.096  -5.362 |  -0.128   1.302
## black                |  -0.055   0.048   0.001  -0.546 |  -0.208   0.905
## Earl Grey            |  -0.284   3.278   0.145  -6.592 |   0.019   0.019
## green                |   1.784  22.141   0.393  10.844 |   0.356   1.189
##                         cos2  v.test     Dim.3     ctr    cos2  v.test  
## escape-exoticism       0.171  -7.142 |  -0.622  16.795   0.348 -10.196 |
## Not.escape-exoticism   0.171   7.142 |   0.559  15.094   0.348  10.196 |
## chain store            0.383 -10.707 |   0.054   0.170   0.005   1.242 |
## chain store+tea shop   0.090   5.179 |  -0.117   0.325   0.005  -1.197 |
## tea shop               0.306   9.559 |  -0.041   0.015   0.000  -0.237 |
## dinner                 0.219   8.084 |  -1.258  10.166   0.119  -5.970 |
## Not.dinner             0.219  -8.084 |   0.095   0.765   0.119   5.970 |
## black                  0.014  -2.055 |   1.228  34.109   0.494  12.149 |
## Earl Grey              0.001   0.434 |  -0.492  14.290   0.437 -11.429 |
## green                  0.016   2.167 |   0.125   0.157   0.002   0.758 |
## 
## Categorical variables (eta2)
##                        Dim.1 Dim.2 Dim.3  
## escape.exoticism     | 0.038 0.171 0.348 |
## where                | 0.404 0.479 0.006 |
## dinner               | 0.096 0.219 0.119 |
## Tea                  | 0.403 0.025 0.529 |
## resto                | 0.384 0.051 0.080 |
## pub                  | 0.256 0.231 0.008 |
par(mfrow = c(1, 2)) # Plot figures next to each other.

plot.MCA(
  mca_tea_data,
  invisible = "ind",
  graph.type = "classic",
  habillage = "quali",
  title = "MCA factor map"
)

plot.MCA(
  mca_tea_data,
  invisible = "ind",
  graph.type = "classic",
  habillage = "quali",
  selectMod = "contrib 5",
  # Show five categories that contribute the most (crt > 12.0).
  title = "MCA factor map with most contributing categories"
)


Chapter 6: Analysis of longitudinal data

date()
## [1] "Tue Dec 13 04:24:31 2022"

Analysis exercises

Analyzing RATS data

The prepared RATS data by Martin J. Crowder and David J. Hand measures the body weight of rats over a 64-day period (once a week). It consists of five variables and 176 observations each. The variables are ID (n = 16), Group (n = 3), WD (original weight measuring day I left in the data just in case), Weight_g (g for grams), and Time (weight measuring day).

# Import prepared RATS data.
rats <- read.csv("Data/rats.csv")

# Check data properties.
str(rats)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Group   : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ WD      : chr  "WD1" "WD1" "WD1" "WD1" ...
##  $ Weight_g: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time    : int  1 1 1 1 1 1 1 1 1 1 ...

ID and Group variables should be treated as names/lables (factors) instead of values (integers), therefore these will be converted. Curiously, the summary table shows that there are twice the amount of observations for Group 1 (n = 88) compared to other two groups (n = 44).

# Convert ID and Group from integers (numbers) to factors.
library(tidyverse)

rats <-
  rats %>% mutate(ID = as.factor(ID), Group = as.factor(Group))

# Check summary.
summary(rats)
##        ID      Group       WD               Weight_g          Time      
##  1      : 11   1:88   Length:176         Min.   :225.0   Min.   : 1.00  
##  2      : 11   2:44   Class :character   1st Qu.:267.0   1st Qu.:15.00  
##  3      : 11   3:44   Mode  :character   Median :344.5   Median :36.00  
##  4      : 11                             Mean   :384.5   Mean   :33.55  
##  5      : 11                             3rd Qu.:511.2   3rd Qu.:50.00  
##  6      : 11                             Max.   :628.0   Max.   :64.00  
##  (Other):110

Checking the distribution of the weights via a density plot and a Q-Q plot reveals that it is not normally distributed and there are two distinct groups of rats (those below ca. 300 grams and others above ca. 400 grams).

library(ggpubr)
library(patchwork)

density_plot <- ggdensity(rats$Weight_g)
qqplot <- ggqqplot(rats$Weight_g)

density_plot + qqplot

Plotting the data both as is (raw) in panel a, and scaled (standardized)in panel b with line plots confirms this. We can also see that Group 3 has the lightest rats and Group 1 the heaviest. Group 2 in the middle but there is one outlier rat in that group.

library(ggthemes)
library(gridExtra)

# Plot data as is.
line_plot1 <- ggplot(rats, aes(
  x = Time,
  y = Weight_g,
  group = ID,
  color = Group
)) +
  geom_line(linewidth = 1) +
  labs(y = "Weight (grams)") +
  scale_color_manual(values = ptol_pal()(length(unique(rats$Group)))) +
  theme_bw() +
  theme(
    axis.title = element_text(face = "bold"),
    axis.text = element_text(size = 10),
    axis.title.x = element_blank(),
    legend.title = element_text(face = "bold"),
    legend.text = element_text(size = 10)
  )

# Standardize data and plot again.
rats_scaled <- rats %>% group_by(Time) %>%
  mutate(Weight_scaled = scale(Weight_g)) %>%
  ungroup()

line_plot2 <- ggplot(rats_scaled,
                     aes(
                       x = Time,
                       y = Weight_scaled,
                       group = ID,
                       color = Group
                     )) +
  geom_line(linewidth = 1) +
  labs(y = "Weight in grams (Standardized)") +
  scale_color_manual(values = ptol_pal()(length(unique(rats$Group)))) +
  theme_bw() +
  theme(
    axis.title = element_text(face = "bold"),
    axis.text = element_text(size = 10),
    axis.title.x = element_blank(),
    legend.title = element_text(face = "bold"),
    legend.text = element_text(size = 10)
  )

# Join plots.
joined_plot <- line_plot1 + line_plot2 +
  plot_layout(guides = "collect") +
  plot_annotation(tag_levels = "a", tag_suffix = ")") &
  theme(
    legend.text = element_text(color = "black", size = 10),
    plot.tag = element_text(color = "black", face = "bold")
  )

grid.arrange(patchworkGrob(joined_plot),
             bottom = text_grob("Day", face = "bold"))

The data will be next summarized with a summary plot of means and standard errors. The relationship between the groups seems to stay the same even with Group 2 and its outlier (panel a). However, this results in a wider deviation and higher mean. Removing the outlier reduces the variation and lowers the mean weight (panel b).

# Calculate mean weights and standard errors.
rats_mean <- rats %>%
  group_by(Group, Time) %>%
  summarise(Weight_mean = mean(Weight_g),
            Weight_se = sd(Weight_g) / sqrt(length((Weight_g)))) %>%
  ungroup()

# Plot mean of raw data.
summary_plot1 <- ggplot(rats_mean,
                        aes(x = Time,
                            y = Weight_mean,
                            color = Group)) +
  geom_line(linewidth = 1) +
  geom_point(size = 3) +
  geom_errorbar(
    aes(ymin = Weight_mean - Weight_se,
        ymax = Weight_mean + Weight_se),
    width = 1.3,
    linewidth = 0.7
  ) +
  scale_color_manual(values = ptol_pal()(length(unique(rats_mean$Group)))) +
  labs(y = "Weight in grams (Mean +/- StdErr)") +
  theme_bw() +
  theme(
    axis.title = element_text(face = "bold"),
    axis.text = element_text(size = 10),
    axis.title.x = element_blank(),
    legend.title = element_text(face = "bold"),
    legend.text = element_text(size = 10),
  )

# Remove outlier i.e., ID 12, and recalculate values.
rats_mean_filtered <- rats %>% filter(ID != "12") %>%
  group_by(Group, Time) %>%
  summarise(Weight_mean = mean(Weight_g),
            Weight_se = sd(Weight_g) / sqrt(length((Weight_g)))) %>%
  ungroup()

# Plot filtered data.
summary_plot2 <- ggplot(rats_mean_filtered,
       aes(x = Time,
           y = Weight_mean,
           color = Group)) +
  geom_line(linewidth = 1) +
  geom_point(size = 3) +
  geom_errorbar(
    aes(ymin = Weight_mean - Weight_se,
        ymax = Weight_mean + Weight_se),
    width = 1.3,
    linewidth = 0.7
  ) +
  scale_color_manual(values = ptol_pal()(length(unique(rats_mean$Group)))) +
  theme_bw() +
  theme(
    axis.title = element_blank(),
    legend.title = element_text(face = "bold"),
    legend.text = element_text(size = 10),
  )

summary_plots <- summary_plot1 + summary_plot2 +
  plot_layout(guides = "collect") +
  plot_annotation(tag_levels = "a", tag_suffix = ")") &
  theme(
    legend.text = element_text(color = "black", size = 10),
    plot.tag = element_text(color = "black", face = "bold")
  )

grid.arrange(patchworkGrob(summary_plots),
             bottom = text_grob("Day", face = "bold"))

Plotting the data with a box plot conversely shows that there are no outliers. Before plotting the data, I removed baseline measurements (day 1) in preparation for t-test and ANOVA later. While I could assume based on this that it is okay to run these tests on this data as is, I’ll use the data without the outlier rat (ID 12).

# Filter out baseline weight i.e., weight on day 1.
new_rats_mean <- rats_mean %>% filter(Time > 1)

# Make box plot.
ggplot(new_rats_mean,
       aes(x = Group,
           y = Weight_mean, fill = Group)) +
  geom_boxplot(color = "black") +
  stat_summary(
    fun = "mean",
    geom = "point",
    shape = 23,
    size = 3
  ) +
  scale_fill_manual(values = ptol_pal()(length(unique(rats_mean$Group))), guide = "none") +
  labs(x = "Group", y = "Weight in grams (mean)") +
  theme_bw() +
  theme(
    axis.title = element_text(face = "bold"),
    legend.title = element_text(face = "bold"),
    legend.text = element_text(size = 10),
    axis.text = element_text(size = 10)
  )

The t-test, which was run to see whether there is a statistical difference in mean weights among the rat groups, resulted in the following table. Looking at the p-values, it is safe to assume that we can reject the null hypothesis (there is no statistical difference in mean weights among the rat groups). The plot already show as much but it is nice to have a quantitative/statistical measure confirming this.

library(rstatix)
library(knitr)

# Remove baseline (day 1) measurements.
new_rats_mean_filtered <- rats_mean_filtered %>% filter(Time > 1)

# Run t-test and output results to a table.
kable(new_rats_mean_filtered %>% t_test(Weight_mean ~ Group))
.y. group1 group2 n1 n2 statistic df p p.adj p.adj.signif
Weight_mean 1 2 10 10 -32.70072 11.54387 0 0 ****
Weight_mean 1 3 10 10 -55.64033 12.90932 0 0 ****
Weight_mean 2 3 10 10 -10.97700 17.12515 0 0 ****

To run t-test and ANOVA, I created a new dataframe which has mean weights per ID. Based on previous plots and analyses, I’ll filter out baseline measurements and ID 12 as an outlier. The fitted linear model of rats shows statistically positive correlation among inputted variables, and looks really good (maybe too good): the residuals seem to be equally distributed; p-values for coefficients are all statistically significant; and R2 values are >99%, meaning that over 99% of the variance in mean weights can be explained by our model. This is a suspiciously accurate model with near-perfect predictive power. The ANOVA in turn shows that the mean weights differ from the baseline, and among the groups with statistical significance. The p-value of F-statistic shows that the fitted model is overall statistically significant.

# Calculate mean weights per group.
rats_group_mean <-
  rats %>%
  filter(Time > 1, ID != "12") %>%
  group_by(Group, ID) %>%
  summarize(Weight_mean = mean(Weight_g))

# Create data frame of baseline weights.
rats_baseline <-
  rats %>%
  dplyr::filter(Time == "1", ID != "12") %>%
  group_by(Group, ID) %>%
  summarize(Weight_mean_day1 = mean(Weight_g))

# Join data frames.
new_rats <- inner_join(rats_group_mean, rats_baseline)

# Fit new data to a linear model and print summary.
rats_fit <-
  lm(Weight_mean ~ Weight_mean_day1 + Group, data = new_rats)

summary(rats_fit)
## 
## Call:
## lm(formula = Weight_mean ~ Weight_mean_day1 + Group, data = new_rats)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2818  -5.4247  -0.0667   5.7389  13.7887 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      119.8515    31.5069   3.804 0.002923 ** 
## Weight_mean_day1   0.5792     0.1251   4.630 0.000728 ***
## Group2            89.2652    22.0021   4.057 0.001892 ** 
## Group3           112.9571    32.7344   3.451 0.005421 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.749 on 11 degrees of freedom
## Multiple R-squared:  0.996,  Adjusted R-squared:  0.9949 
## F-statistic: 911.4 on 3 and 11 DF,  p-value: 1.842e-13
# Run ANOVA.
anova(rats_fit)
## Analysis of Variance Table
## 
## Response: Weight_mean
##                  Df Sum Sq Mean Sq   F value    Pr(>F)    
## Weight_mean_day1  1 207817  207817 2714.9357 1.601e-14 ***
## Group             2   1483     742    9.6878  0.003748 ** 
## Residuals        11    842      77                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Lastly, I wanted (as something extra) to see what the actual/quantitative difference was among mean weights of the rat groups compared to the baseline weight. The difference was quite small, yet based on the previous tests, statistically significant.

kable(
  new_rats %>% group_by(Group) %>% summarize(
    Weight_mean = mean(Weight_mean),
    Weight_mean_day1 = mean(Weight_mean_day1),
    Difference = round(Weight_mean - Weight_mean_day1, 2),
    "Difference (%)" = round((1 - Weight_mean_day1 / Weight_mean) * 100, 2)
  )
)
Group Weight_mean Weight_mean_day1 Difference Difference (%)
1 265.025 250.625 14.40 5.43
2 452.400 420.000 32.40 7.16
3 527.500 508.750 18.75 3.55

Analyzing BPRS data

The data from Davis (2002) contains the measurements of brief psychiatric rating scale (BPRS) for males (n = 40) over a period of eight weeks (1–8). There is also a baseline (week 0). The scale is between one and seven (from “not present” to “extremely severe”), and is used to diagnose schizophrenia. For this part of the exercises, I’ll use a prepared dataset, which has five variables (treatment, subject, week_og, bprs, and week) and 360 observations. The variable week_og is the original annotation for the measurement weeks. There’s nothing interesting in the data summary at a quick glance (for me at least).

# Import prepared BPRS data.
bprs <-
  read.csv("Data/bprs.csv") %>%
  mutate(treatment = as.factor(treatment),
         subject = as.factor(subject),
         week = as.factor(week))

# Check data properties.
str(bprs)
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ week_og  : chr  "week0" "week0" "week0" "week0" ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : Factor w/ 9 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
# Check summary of data.
summary(bprs)
##  treatment    subject      week_og               bprs            week    
##  1:180     1      : 18   Length:360         Min.   :18.00   0      : 40  
##  2:180     2      : 18   Class :character   1st Qu.:27.00   1      : 40  
##            3      : 18   Mode  :character   Median :35.00   2      : 40  
##            4      : 18                      Mean   :37.66   3      : 40  
##            5      : 18                      3rd Qu.:43.00   4      : 40  
##            6      : 18                      Max.   :95.00   5      : 40  
##            (Other):252                                      (Other):120

Plotting the data shows a rough downward trend in the BPRS values. It also shows a lot of variance among the subjects.

ggplot(bprs, aes(
  x = week,
  y = bprs,
  group = interaction(subject, treatment),
  color = treatment
)) +
  geom_line(linewidth = 1) +
  labs(x = "Week", y = "Brief psychiatric rating scale (BPRS)", color = "Treatment") +
  scale_color_manual(values = ptol_pal()(length(unique(bprs$treatment)))) +
  theme_bw() +
  theme(
    axis.title = element_text(face = "bold"),
    legend.title = element_text(face = "bold"),
    legend.text = element_text(size = 10),
    axis.text = element_text(size = 10)
  )

Fitting the variables into a linear model, it shows similar results. The estimates in the coefficients section are mostly negative supporting the negative association between BPRS and weeks, and these are mostly statistically significant with exception of week 1. Also treatment 2 is not statistically significant. The R2 values show that ca. 20% of the variance in BPRS is explained by the inputted variables, meaning that the fitted model is not that great. The concluding F-statistics show a statistically significant overall p-value for the fitted model.

bprs_reg <- lm(bprs ~ week + treatment, bprs)

summary(bprs_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = bprs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.714  -9.226  -2.989   6.786  48.389 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  47.7139     2.0590  23.173  < 2e-16 ***
## week1        -1.6750     2.7624  -0.606  0.54467    
## week2        -6.3000     2.7624  -2.281  0.02317 *  
## week3        -8.8500     2.7624  -3.204  0.00148 ** 
## week4       -11.6500     2.7624  -4.217 3.15e-05 ***
## week5       -15.4500     2.7624  -5.593 4.51e-08 ***
## week6       -16.7750     2.7624  -6.073 3.28e-09 ***
## week7       -15.8000     2.7624  -5.720 2.29e-08 ***
## week8       -16.5750     2.7624  -6.000 4.92e-09 ***
## treatment2    0.5722     1.3022   0.439  0.66063    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.35 on 350 degrees of freedom
## Multiple R-squared:  0.2026, Adjusted R-squared:  0.182 
## F-statistic: 9.878 on 9 and 350 DF,  p-value: 1.62e-13

Next, a linear mixed effect model with a random intercept model, in which we assume that the repeated observations (in this case, the measurements of BPRS) are not independent. The model seems to echo the negative correlation across weeks mentioned previously.

library(lme4)

# Create random intercept model with subject for the random effect.
bprs_ref <- lmer(bprs ~ week + treatment + (1 | subject), bprs, REML = FALSE)

# Print summary of the model.
summary(bprs_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: bprs
## 
##      AIC      BIC   logLik deviance df.resid 
##   2751.3   2798.0  -1363.7   2727.3      348 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.2309 -0.6213 -0.1018  0.4877  3.3537 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.6     6.899  
##  Residual             100.8    10.039  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  47.7139     2.2758  20.965
## week1        -1.6750     2.2448  -0.746
## week2        -6.3000     2.2448  -2.807
## week3        -8.8500     2.2448  -3.942
## week4       -11.6500     2.2448  -5.190
## week5       -15.4500     2.2448  -6.883
## week6       -16.7750     2.2448  -7.473
## week7       -15.8000     2.2448  -7.039
## week8       -16.5750     2.2448  -7.384
## treatment2    0.5722     1.0582   0.541
## 
## Correlation of Fixed Effects:
##            (Intr) week1  week2  week3  week4  week5  week6  week7  week8 
## week1      -0.493                                                        
## week2      -0.493  0.500                                                 
## week3      -0.493  0.500  0.500                                          
## week4      -0.493  0.500  0.500  0.500                                   
## week5      -0.493  0.500  0.500  0.500  0.500                            
## week6      -0.493  0.500  0.500  0.500  0.500  0.500                     
## week7      -0.493  0.500  0.500  0.500  0.500  0.500  0.500              
## week8      -0.493  0.500  0.500  0.500  0.500  0.500  0.500  0.500       
## treatment2 -0.232  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000

Here, I do the same analysis as above but with the addition of a random slope, after which the this and the previous model will be analyzed with anova. The summary shows that the distribution of scaled residuals look quite even. The correlations look mostly negative. As for the anova results, they don’t look good as the p-value is a (beautiful) 1. Thus, this model is not statistically significant.

# Create random intercept and random slope model with week and subject.
bprs_ref1 <- lmer(bprs ~ week + treatment + (week | subject), bprs, REML = FALSE)

# Print summary of the model.
summary(bprs_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: bprs
## 
##      AIC      BIC   logLik deviance df.resid 
##   2825.5   3043.1  -1356.7   2713.5      304 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0212 -0.5626 -0.0752  0.5371  3.3227 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr                               
##  subject  (Intercept) 51.940   7.207                                       
##           week1       11.027   3.321     0.33                              
##           week2        8.807   2.968     0.09  0.68                        
##           week3       15.513   3.939     0.00  0.92  0.83                  
##           week4       24.521   4.952    -0.45  0.66  0.37  0.78            
##           week5       30.318   5.506    -0.58  0.47  0.12  0.58  0.96      
##           week6       43.572   6.601    -0.53  0.48  0.08  0.57  0.96  1.00
##           week7       49.964   7.069    -0.41  0.57  0.09  0.61  0.95  0.98
##           week8       51.948   7.207    -0.33  0.64  0.15  0.66  0.96  0.96
##  Residual             90.733   9.525                                       
##             
##             
##             
##             
##             
##             
##             
##             
##   0.99      
##   0.98  1.00
##             
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  47.7139     2.2622  21.092
## week1        -1.6750     2.2557  -0.743
## week2        -6.3000     2.2309  -2.824
## week3        -8.8500     2.3048  -3.840
## week4       -11.6500     2.4006  -4.853
## week5       -15.4500     2.4602  -6.280
## week6       -16.7750     2.5914  -6.473
## week7       -15.8000     2.6523  -5.957
## week8       -16.5750     2.6710  -6.206
## treatment2    0.5722     1.0041   0.570
## 
## Correlation of Fixed Effects:
##            (Intr) week1  week2  week3  week4  week5  week6  week7  week8 
## week1      -0.367                                                        
## week2      -0.430  0.517                                                 
## week3      -0.434  0.552  0.535                                          
## week4      -0.566  0.519  0.474  0.548                                   
## week5      -0.613  0.485  0.431  0.512  0.606                            
## week6      -0.601  0.479  0.406  0.505  0.616  0.640                     
## week7      -0.551  0.490  0.400  0.510  0.618  0.640  0.666              
## week8      -0.518  0.504  0.407  0.522  0.620  0.636  0.663  0.678       
## treatment2 -0.222  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
# Run ANOVA on the models.
anova(bprs_ref1, bprs_ref)
## Data: bprs
## Models:
## bprs_ref: bprs ~ week + treatment + (1 | subject)
## bprs_ref1: bprs ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)
## bprs_ref    12 2751.3 2798.0 -1363.7   2727.3                     
## bprs_ref1   56 2825.5 3043.1 -1356.7   2713.5 13.837 44          1

For the final section, the BPRS data will be fitted with a random intercept and slope model and analyzed for the interaction between week and treatment group. Seemingly there is a negative covariance between treatment 2 and weeks 1–5, and a positive covariance between weeks 6–7. Running ANOVA with this and the previous model, it suggests that the mean do not differ with statistical difference (p-value of 0.3598). Plotting the fitted model shows again a general downward trend from baseline until week 5, after which the BPRS rises with treatment 2. The other treatment group has more varying covariance trends from week 5 onwards.

# Create random intercept and random slope model with the interaction.
bprs_ref2 <- lmer(bprs ~ week + treatment + week * treatment + (week | subject), bprs, REML = FALSE)

# Print summary of the model.
summary(bprs_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + week * treatment + (week | subject)
##    Data: bprs
## 
##      AIC      BIC   logLik deviance df.resid 
##   2832.7   3081.4  -1352.3   2704.7      296 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.1485 -0.5778 -0.0825  0.5463  3.4517 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr                               
##  subject  (Intercept) 53.16    7.291                                       
##           week1       11.76    3.429     0.29                              
##           week2       10.47    3.235     0.04  0.70                        
##           week3       17.69    4.206    -0.04  0.92  0.85                  
##           week4       26.32    5.130    -0.47  0.68  0.42  0.79            
##           week5       31.56    5.618    -0.59  0.48  0.16  0.59  0.96      
##           week6       45.07    6.713    -0.54  0.50  0.12  0.58  0.95  1.00
##           week7       51.52    7.178    -0.42  0.57  0.12  0.61  0.95  0.98
##           week8       53.68    7.327    -0.35  0.65  0.18  0.66  0.95  0.96
##  Residual             88.10    9.386                                       
##             
##             
##             
##             
##             
##             
##             
##             
##   0.99      
##   0.98  1.00
##             
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                  Estimate Std. Error t value
## (Intercept)        47.000      2.658  17.685
## week1              -0.200      3.066  -0.065
## week2              -3.450      3.055  -1.129
## week3              -6.100      3.114  -1.959
## week4             -10.400      3.182  -3.268
## week5             -14.300      3.223  -4.437
## week6             -17.300      3.326  -5.201
## week7             -17.200      3.374  -5.097
## week8             -17.700      3.390  -5.221
## treatment2          2.000      2.968   0.674
## week1:treatment2   -2.950      4.198  -0.703
## week2:treatment2   -5.700      4.198  -1.358
## week3:treatment2   -5.500      4.198  -1.310
## week4:treatment2   -2.500      4.198  -0.596
## week5:treatment2   -2.300      4.198  -0.548
## week6:treatment2    1.050      4.198   0.250
## week7:treatment2    2.800      4.198   0.667
## week8:treatment2    2.250      4.198   0.536
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
# Run ANOVA on the models.
anova(bprs_ref2, bprs_ref1)
## Data: bprs
## Models:
## bprs_ref1: bprs ~ week + treatment + (week | subject)
## bprs_ref2: bprs ~ week + treatment + week * treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)
## bprs_ref1   56 2825.5 3043.1 -1356.7   2713.5                     
## bprs_ref2   64 2832.7 3081.4 -1352.3   2704.7 8.7955  8     0.3598
# Create a vector of the fitted values.
Fitted <- fitted(bprs_ref2)

# Add fitted values to data frame.
new_bprs <- bprs %>% mutate(fitted = Fitted)

# Plot data.
ggplot(new_bprs,
       aes(
         x = week,
         y = fitted,
         group = interaction(subject, treatment),
         color = treatment
       )) +
  geom_line(linewidth = 1) +
  labs(x = "Week", y = "Brief psychiatric rating scale (fitted)", color = "Treatment") +
  scale_color_manual(values = ptol_pal()(length(unique(bprs$treatment)))) +
  theme_bw() +
  theme(
    axis.title = element_text(face = "bold"),
    legend.title = element_text(face = "bold"),
    legend.text = element_text(size = 10),
    axis.text = element_text(size = 10)
  )


(More chapters to be added… Just kidding, it’s over!)